This document was creating using R Markdown. R Markdown is great tool to help organize and annotate your code, thus improving tranferability and reproducability of your methods. Code created in R Markdown can be “knitted” into HTML, PDF, and MS word documents. For this workshop, a corresponding HTML document has been created. Feel free to follow along using either the raw .Rmd document or the HTML document.
The purpose of this workshop is to provide a brief introduction to using the ggplot package for creating publication-quality figures. The ggplot package is part of the tidyverse. We will continue to practice using “tidy format” code in this workshop. This workshop will also demonstrate how to use base R to conduct basic statistical analyses, such as t-tests. Finally, we will practice using ggplot to display data and indicate statistically important information.
Disclaimer: I am not a statistician and this workshop is not intended to teach you how to choose the appropriate statistical approach for your research question. While I am always happy to discuss and help to the best of my ability, I suggest that for more complex statistical questions you consider deferring to the statisticians at the Centre for Health Care Innovation (CHI). All RFHS graduate students are entitled to 5 hours of free consultation at the CHI.
This workshop will use data from the E. Mick et. al. (2020) paper titled Upper airway gene expression reveals suppressed immune responses to SARS-CoV-2 compared with other respiratory viruses. Data is provided to you through the MMID github repository and can also be found GEO accession. This data set includes gene counts, cell type estimates, age, and sex information from 234 patients with acute respiratory infections, including COVID-19.
In today’s workshop we’ll recreate Figures 1c and 1d to learn ggplot, how to run statistics, and how to display signifance bars.
First, we must load the R packages that we need to conduct our analyses. I prefer to include these at the top of my code for easy reference.
library(tidyverse) # the tidyverse package also loads ggplot2
library(here) # the here package makes it easier to load and save files through simple directory calling
library(readxl) # read in xlsx files
library(GEOquery) # for accessing the geo accession data
library(biomaRt) # geo ensembl names
library(limma) # for running differential expression
library(edgeR) # for voom
# ggbeeswarm is for point plotting
# must download dev version from git for corrals to work
# devtools::install_github("eclarke/ggbeeswarm")
library(ggbeeswarm)
library(ggsignif) # significance bars
library(robustbase) # linear regressions in fig1d
library(ggeffects) # for calculating marginal effects
library(cowplot) # alternate to faceting for combining individual plots
Next, we have to read in our data.
# get geo data, which contains the metadata
gse_mat <- getGEO('GSE156063',GSEMatrix=TRUE)
pdata <- pData(phenoData(gse_mat[[1]]))
glimpse(pdata) # check size; quickly examine the pdata
## Rows: 234
## Columns: 50
## $ title <chr> "RR057e_00202_N05_S78", "RR057e_00080_H20_S3…
## $ geo_accession <chr> "GSM4721578", "GSM4721579", "GSM4721580", "G…
## $ status <chr> "Public on Aug 12 2020", "Public on Aug 12 2…
## $ submission_date <chr> "Aug 11 2020", "Aug 11 2020", "Aug 11 2020",…
## $ last_update_date <chr> "Aug 12 2020", "Aug 12 2020", "Aug 12 2020",…
## $ type <chr> "SRA", "SRA", "SRA", "SRA", "SRA", "SRA", "S…
## $ channel_count <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1",…
## $ source_name_ch1 <chr> "NP/OP swab", "NP/OP swab", "NP/OP swab", "N…
## $ organism_ch1 <chr> "Homo sapiens", "Homo sapiens", "Homo sapien…
## $ characteristics_ch1 <chr> "gender: F", "gender: M", "gender: F", "gend…
## $ characteristics_ch1.1 <chr> "age: 62", "age: 81", "age: 76", "age: 36", …
## $ characteristics_ch1.2 <chr> "sars-cov-2 rpm: NEG", "sars-cov-2 rpm: NEG"…
## $ characteristics_ch1.3 <chr> "sars-cov-2 pcr: 0.313853190805252", "sars-c…
## $ characteristics_ch1.4 <chr> "disease state: no virus", "disease state: n…
## $ treatment_protocol_ch1 <chr> "DNAse", "DNAse", "DNAse", "DNAse", "DNAse",…
## $ growth_protocol_ch1 <chr> "Clinical naso/pharyngeal swab specimens", "…
## $ molecule_ch1 <chr> "total RNA", "total RNA", "total RNA", "tota…
## $ extract_protocol_ch1 <chr> "Zymo DNA/RNA magbead", "Zymo DNA/RNA magbea…
## $ extract_protocol_ch1.1 <chr> "NEBnext with Qiagen Fastselect rRNA depleti…
## $ taxid_ch1 <chr> "9606", "9606", "9606", "9606", "9606", "960…
## $ description <chr> "RR057e_00202", "RR057e_00080", "RR057e_0028…
## $ data_processing <chr> "Alignment against human genome- Kallisto v …
## $ data_processing.1 <chr> "Gene-level counts were generated from the t…
## $ data_processing.2 <chr> "Fitering to retain samples with 400,000 gen…
## $ data_processing.3 <chr> "Differential expression with R package limm…
## $ data_processing.4 <chr> "Genome_build: HG-38, ENSEMBL v. 99", "Genom…
## $ data_processing.5 <chr> "Supplementary_files_format_and_content: .cs…
## $ platform_id <chr> "GPL24676", "GPL24676", "GPL24676", "GPL2467…
## $ contact_name <chr> "Charles,,Langelier", "Charles,,Langelier", …
## $ contact_email <chr> "chaz.langelier@ucsf.edu, chazlangelier@gmai…
## $ contact_phone <chr> "801-201-5049", "801-201-5049", "801-201-504…
## $ contact_department <chr> "Medicine", "Medicine", "Medicine", "Medicin…
## $ contact_institute <chr> "University of California San Francisco", "U…
## $ contact_address <chr> "499 Illinois St", "499 Illinois St", "499 I…
## $ contact_city <chr> "San Francisco", "San Francisco", "San Franc…
## $ contact_state <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "C…
## $ `contact_zip/postal_code` <chr> "94158", "94158", "94158", "94158", "94158",…
## $ contact_country <chr> "USA", "USA", "USA", "USA", "USA", "USA", "U…
## $ data_row_count <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
## $ instrument_model <chr> "Illumina NovaSeq 6000", "Illumina NovaSeq 6…
## $ library_selection <chr> "cDNA", "cDNA", "cDNA", "cDNA", "cDNA", "cDN…
## $ library_source <chr> "transcriptomic", "transcriptomic", "transcr…
## $ library_strategy <chr> "RNA-Seq", "RNA-Seq", "RNA-Seq", "RNA-Seq", …
## $ relation <chr> "BioSample: https://www.ncbi.nlm.nih.gov/bio…
## $ supplementary_file_1 <chr> "NONE", "NONE", "NONE", "NONE", "NONE", "NON…
## $ `age:ch1` <chr> "62", "81", "76", "36", "58", "79", "89", "6…
## $ `disease state:ch1` <chr> "no virus", "no virus", "no virus", "SC2", "…
## $ `gender:ch1` <chr> "F", "M", "F", "F", "F", "F", "M", "F", "F",…
## $ `sars-cov-2 pcr:ch1` <chr> "0.313853190805252", "0.098334479421691", "0…
## $ `sars-cov-2 rpm:ch1` <chr> "NEG", "NEG", "NEG", "POS", "POS", "NEG", "N…
# the pdata has a lot of info not relevant to this workshop
# let's clean up naming and select only the relevant columns
# clean up pdata
pdata_sub <- pdata %>%
# mutate creates new columns based on existing columns
# also lets you change class of existing columns
mutate(sex = as.factor(gsub("gender: ", "", characteristics_ch1)),
age = as.numeric(gsub("age: ", "", characteristics_ch1.1)),
disease_state = `disease state:ch1`,
sarscov2_pcr = as.numeric(`sars-cov-2 pcr:ch1`),
sarscov2_rpm = as.factor(`sars-cov-2 rpm:ch1`),
participant_ID = description) %>%
mutate(disease_state = as.factor(gsub("[[:space:]]", "", disease_state))) %>%
# select specific columns
dplyr::select(participant_ID, sarscov2_pcr, sarscov2_rpm, disease_state, age, sex)
# read in the gene counts data
# this is included in the supplement geo accession data
# gene counts have already been filted
exp <- read.csv(file = here("input_data",
"GSE156063_swab_gene_counts.csv"))
dim(exp) # check size
## [1] 15979 235
# make ensembl id the rowname, then remove ensembl column
rownames(exp) <- exp$X
exp <- exp %>% dplyr::select(-X)
# the expression data has one more participant than pdata
# this individual should be removed
# subset for individuals with pdata info
exp_sub <- exp %>%
dplyr::select(pdata_sub$participant_ID)
# read in cell estimates
# this was supp data included with the paper
cells <- read_xlsx(here("input_data", "cell_estimates.xlsx"))
# lets participants are in the same order in all data frames
# this will make sure we dont accidentally analyze the wrong data
identical(colnames(exp_sub), pdata_sub$participant_ID) # TRUE
## [1] TRUE
identical(pdata_sub$participant_ID, cells$CZB_ID) # FALSE
## [1] FALSE
# reorder cells to match pdata
cells <- cells[order(match(cells$CZB_ID, pdata_sub$participant_ID,)),]
# recheck if they match now
identical(pdata_sub$participant_ID, cells$CZB_ID) # TRUE
## [1] TRUE
Limma is an R package developed to look at associations between exposures/outcomes of interest and microarray omic data such as gene expression or DNA methylation. I’ve recreated the author’s analysis based on the paper methods. I don’t have a lot of experience with voom so I followed this pipeline: https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/thursday/DE.html.
##############################################
# array wide DE analysis with limma and voom #
##############################################
# create model matrix - specified in paper
# intercept specified as 0
# this website does a good job of explaining how
# setting intercepts changes interpretation
# https://genomicsclass.github.io/book/pages/interactions_and_contrasts.html
# sex and age are covariates that may introduce noise into our data
# by including them in our model matrix we are removing that noise
mod <- model.matrix(~0 + disease_state + sex + age, pdata_sub)
# quantile normalization as per paper methods
# voom plot looks "good" - should look smooth
exp_norm <- voom(exp_sub, mod, normalize="quantile", plot = T)
# fit with limma
# fitting is based to you model matrix, we fit the gene count data
fit <- lmFit(exp_norm, mod)
head(coef(fit))
## disease_statenovirus disease_stateothervirus disease_stateSC2
## ENSG00000000003 7.184807 6.303187 7.324059
## ENSG00000000419 4.091305 3.785508 4.277102
## ENSG00000000457 5.144273 5.357076 5.338205
## ENSG00000000460 3.684206 3.664466 3.583829
## ENSG00000000938 3.443185 5.593719 2.864004
## ENSG00000000971 5.797583 6.348487 6.674473
## sexM age
## ENSG00000000003 -0.239161018 -0.0073830694
## ENSG00000000419 0.003434022 -0.0011280697
## ENSG00000000457 -0.015346096 -0.0031497622
## ENSG00000000460 0.024167447 -0.0007047909
## ENSG00000000938 0.725152380 0.0027183265
## ENSG00000000971 -0.116959082 -0.0047694672
##################
# make contrasts #
##################
# contrasts are matrices that tell limma what values to compare to one another
# we will make one for each of our comparisons
# covid vs no virus
sc_novirus <- makeContrasts(disease_stateSC2 - disease_statenovirus,
levels = colnames(coef(fit)))
sc_novirus
## Contrasts
## Levels disease_stateSC2 - disease_statenovirus
## disease_statenovirus -1
## disease_stateothervirus 0
## disease_stateSC2 1
## sexM 0
## age 0
# covid vs other virus
sc_other <- makeContrasts(disease_stateSC2 - disease_stateothervirus,
levels = colnames(coef(fit)))
sc_other
## Contrasts
## Levels disease_stateSC2 - disease_stateothervirus
## disease_statenovirus 0
## disease_stateothervirus -1
## disease_stateSC2 1
## sexM 0
## age 0
# other virus vs no virus
other_novirus <- makeContrasts(disease_stateothervirus - disease_statenovirus,
levels = colnames(coef(fit)))
other_novirus
## Contrasts
## Levels disease_stateothervirus - disease_statenovirus
## disease_statenovirus -1
## disease_stateothervirus 1
## disease_stateSC2 0
## sexM 0
## age 0
##############################
# fit contrast for each gene #
##############################
# we apply the constrats to our model fit to get info for each comparison
sc_novirus_contr_fit <- contrasts.fit(fit, sc_novirus)
sc_other_contr_fit <- contrasts.fit(fit, sc_other)
other_novirus_contr_fit <- contrasts.fit(fit, other_novirus)
###################################
# ebayes smooth for each contrast #
###################################
# the ebayes process smoothes the data
sc_novirus_ebayes <- eBayes(sc_novirus_contr_fit)
sc_other_ebayes <- eBayes(sc_other_contr_fit)
other_novirus_ebayes <- eBayes(other_novirus_contr_fit)
##################################################################
# top table adjusted for bejamini hochberg fdr for each contrast #
##################################################################
# the top table function orders the data according to p-value
# the authors adjusted for false discovery rate using the
# Benjamini Hochberg (BH) method
sc_novirus_top <- topTable(sc_novirus_ebayes, sort.by = "P",
adjust.method="BH", n = Inf)
sc_other_top <- topTable(sc_other_ebayes, sort.by = "P",
adjust.method="BH", n = Inf)
other_novirus_top <- topTable(other_novirus_ebayes, sort.by = "P",
adjust.method="BH", n = Inf)
# make unique name in each comparison for combining later
colnames(sc_novirus_top) <- paste("sc_novirus_",
colnames(sc_novirus_top), sep="")
colnames(sc_other_top) <- paste("sc_other_",
colnames(sc_other_top), sep="")
colnames(other_novirus_top) <- paste("other_novirus_",
colnames(other_novirus_top), sep="")
# match the row order of sc_other and other_novirus to sc_novirus
sc_other_top <- sc_other_top[order(match(rownames(sc_other_top),
rownames(sc_novirus_top))),]
other_novirus_top <- other_novirus_top[order(match(rownames(other_novirus_top),
rownames(sc_novirus_top))),]
# check that they are now in the same order
identical(rownames(sc_novirus_top), rownames(sc_other_top)) # TRUE
## [1] TRUE
identical(rownames(sc_novirus_top), rownames(other_novirus_top)) # TRUE
## [1] TRUE
# cbind top tables together
de <- cbind(sc_novirus_top, sc_other_top, other_novirus_top)
de$ensembl_gene_id <- rownames(de)
# get gene name annotation for ensembl
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
# get gene list
gene_list <- getBM(filters= "ensembl_gene_id",
attributes= c("ensembl_gene_id","hgnc_symbol"),
values=rownames(de),
mart= mart)
# add in gene symbol
de <- de %>%
right_join(gene_list, by = "ensembl_gene_id")
There is a myriad of plot types that are handled by ggplot and base R including line plots, bar plots, box plots, and point plots. Though the syntax of both methods is quite different, you can ultimately obtain the same plot using either method. However, I find the syntax of ggplot is easier to follow, and that the aesthetics of a ggplot are easier to modify.
Every ggplot begins with a call to ggplot(). In ggplot() we will specify the data we are using as well as basic aesthetics (aes()) such as which data belong on the x and y axis. From there, we can specify the type of plot we want, which is referred to as a geom (e.g. geom_boxplot()). We can also fine tune the look of the plot, including altering the legend and plot font size, colour, and font family using the theme() command. Ggplot has some built in themes as well. You can continue adding layers and modifying them until you achieve the plot you want! It’s easier to learn while doing, so this workshop will focus on practice. Each new layer is added to the plot using a “+”. You can add the layers all at once, or you can save your plot, add a new layer, and save again. Removing layers is not straightforward, so I prefer the latter. (If you are interested in removing layers yourself, this thread on stackoverflow is a great place to start.)
Let’s begin out introduction to ggplot by recreating figure 1c from the paper. This figure has a violin plot with points layered on top. The data is plotted separately for disease state and also coloured by disease state. We’ll start with the most simple violin plot.
Another thing to note is that our cell type data is in wide format. Ggplot works best with long data, so lets pivot to long format before plotting.
#######################
# wide to long format #
#######################
# convert cell type data to long format
# the subset cell types based on what was shown in the paper
cells_long <- cells %>%
# later on we will facet the data
# facet titles cannot be change within ggplot so the variable names must be
# changed themselves. this is why im renaming columns here
# the rename function lets you rename columns
rename("B cells" = B_cell,
"Macrophages" = Monocytes_macrophages,
"Dendritic cells" = Dendritic,
"Goblet cells" = Goblet ,
"Neutrophils" = Neutrophil ,
"Ciliated cells" = Ciliated) %>%
# im going to pivot all cell types to long format
pivot_longer(cols = 'B cells':T_cell,
names_to = "Cell_type",
values_to = "Estimates") %>%
# im going to select the cell types the authors included in fig 1c
subset(Cell_type %in% c('Ciliated cells', "Macrophages",
"Dendritic cells", 'Goblet cells', "Neutrophils",
"B cells")) %>%
# im goin gto rename viral status now to make it easier to read
mutate(Viral_status = factor(Viral_status,
levels = c("No virus", "SARS-CoV-2", "Other virus")),
Cell_type = factor(Cell_type, levels = c('Ciliated cells', "Macrophages",
"Dendritic cells", 'Goblet cells', "Neutrophils",
"B cells")))
# quick peek to make sure it looks okay - it does!
head(cells_long)
## # A tibble: 6 × 4
## CZB_ID Viral_status Cell_type Estimates
## <chr> <fct> <fct> <dbl>
## 1 RR057e_00202 No virus B cells 0.0487
## 2 RR057e_00202 No virus Ciliated cells 0.356
## 3 RR057e_00202 No virus Dendritic cells 0.0203
## 4 RR057e_00202 No virus Goblet cells 0.0512
## 5 RR057e_00202 No virus Macrophages 0.206
## 6 RR057e_00202 No virus Neutrophils 0
###############
# violin plot #
###############
# take a moment to explore the violin geom and what options you can control
# ?geom_violin
# basic ggplot using our long data
ggplot(data = cells_long,
aes(x=Cell_type, y = Estimates, fill = Viral_status)) +
geom_violin()
# Practice: try removing "violin" from the geom specification i.e. (geom_)
# and see what other plot suggestions pop-up
This plot is quite crowded. One thing we can do to improve the aesthetics is to facet our graph according to cell type. Faceting creates smaller graphs within a single pane to display the data. There are two facet functions: facet_wrap and facet_grid. They are very similar though facet_wrap tends to be better at utilizing screen space. You can specify the number of rows and columns if you wish.
# explore facet_wrap and facet_grid to see what arguments you can specify
# ?facet_wrap
# ?facet_grid
ggplot(data = cells_long, aes(x=Cell_type, y = Estimates, fill = Viral_status)) +
geom_violin() +
facet_wrap(~Cell_type) # facet the data
Since we have facetted, we no longer need cell types on the x-axis, we can replace this with viral status instead.
ggplot(data = cells_long, aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin() +
facet_wrap(~Cell_type)
In the paper, the authors have a white background and no box around the facet titles. We can start playing around with changing theme parameters to accomplish this. Ggplot has predefined themes that are helpful for changing “base” aesthetics. Lets try the default theme theme_bw().
ggplot(data = cells_long, aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin() +
facet_wrap(~Cell_type) +
theme_bw() # default theme
This is better but there is still a box around the title and gridlines. We can remove this using the theme() command. Within the theme command we can access specific parts of the plot such as the grid lines, axis lines, and backgrounds.
# theme has a lot of controls
# take a moment to explore what you can change in the plot using theme
# ?theme
ggplot(data = cells_long,
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin() +
facet_wrap(~Cell_type, scales = "free") +
theme_bw() +
theme(strip.background = element_blank(), # the box around the facet title
axis.line = element_line(colour = "black"), # leave axis lines black
panel.grid = element_blank(), # remove grid lines
panel.border = element_blank()) # remove border
Given that we’ve stated the viral status on the x axis, we don’t need a legend telling us what colour corresponds to what status. We can remove the legend in theme() by setting the legend position to “none”. This helps to clean up our graph. The most beautiful graphs are simple and clean.
ggplot(data = cells_long,
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin() +
facet_wrap(~Cell_type, scales = "free") +
theme_bw() +
theme(strip.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none") # remove legend
The colours in fig1c are not the default colours used by ggplot. In ggplot, the colour typically refers to the outline of plots and points, while fill corresponds to the colour inside the lines. Colours can be specified either using R colour names or hex values (e.g “#009999”). The list of accetptable R colour names can be accessed by typing colours().
Also notice that the violin plot appears to be transluscent. We can adjust the opacity of plots and points using alpha, which accepts a value from 0 to 1, where 0 is invisible and 1 is opaque.
Let’s adjust the fill, colour, and opacity of the violin plot to match the authors’.
# scale_fill_manual can control more than colour
# explore it to find out
# ?scale_fill_manual
ggplot(data = cells_long,
aes(x=Viral_status, y = Estimates,
fill = Viral_status, colour = Viral_status)) +
# there is no outline around the violin plots in fig 1c
# that means the authors set colour to "NA"
geom_violin(alpha = 0.5, color = NA) +
# scale_fill_manual allows us to specify colours
# factored varibles will be coloured in the order which they appear
scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
facet_wrap(~Cell_type, scales = "free") +
theme_bw() +
theme(strip.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none")
The violins now looks a little broken at the top. The authors set their y axis limits at 1.5*IQR above the third quartile. Let’s revisit our data and use the tidy method to get the values we should use for our y axis limits. We can omit this data one of two ways. We can remove the data above the cut off, or we can change the y axis limits to exclude points above the cut off. Facet wrap doesnt allow users to specify x or y axis limits directly, but we can use a custom facet function (built off facet_wrap) that allows us to specify limits. For learning purposes we will both subset the data, and set the y-axis limits manually using the custom function.
#################
# y axis limits #
#################
# ?ylim
# get y lims based on iqr
iqr <- cells_long %>%
group_by(Cell_type, Viral_status) %>%
summarise(IQR(Estimates),
quantile(Estimates)[4]) %>%
mutate(ylim = `quantile(Estimates)[4]` +
1.5*`IQR(Estimates)`) %>%
group_by(Cell_type) %>%
summarise(max(ylim))
##########
# ggplot #
##########
# custom facet from function from
# https://fishandwhistle.net/post/2018/modifying-facet-scales-in-ggplot2/
# ive put these functions into a separate r script that we will source
source(here("scripts", "custom_facet_wrap.R"))
ggplot(data = cells_long %>%
subset(Cell_type == "Ciliated cells" & Estimates < iqr$`max(ylim)`[1] |
Cell_type == "Macrophages" & Estimates < iqr$`max(ylim)`[2] |
Cell_type == "Dendritic cells" & Estimates < iqr$`max(ylim)`[3] |
Cell_type == "Goblet cells" & Estimates < iqr$`max(ylim)`[4] |
Cell_type == "Neutrophils" & Estimates < iqr$`max(ylim)`[5] |
Cell_type == "B cells" & Estimates < iqr$`max(ylim)`[6]),
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin(alpha = 0.5, colour = NA, trim = T) +
scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
# facet_wrap(~Cell_type, scales = "free") + # this is original facet function
# code for custom y axes on facets
facet_wrap_custom(~Cell_type, scales = "free", scale_overrides = list(
scale_override(1, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[1], 0.2))),
scale_override(2, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[2], 0.2))),
scale_override(3, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[3], 0.05))),
scale_override(4, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[4], 0.1))),
scale_override(5, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[5], 0.2))),
scale_override(6, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[6], .05))))) +
theme_bw() +
theme(strip.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none")
In fig1c, the median is annotated. We can add this in by specifying geom_violin to draw a quantile. However, an issue with calculating the median once we change the y axis limit is that the points which were excluded are no longer included in our median calculation. Therefore, we must calculate and draw the median line ourselves. Because we are not drawing hte median lines are selves, we can also get rid of the outline (colour) around the violin plots to make it more consistent with fig1.c
############
# quantile #
############
# grouping can be very helpful when running stats
# ?group_by
# ?add_column
# get median
med50 <- cells_long %>%
# group_by allows us to group data by multiple variables
# stats will be run seprately for each group
group_by(Cell_type, Viral_status) %>%
# summarize calculates specified summary stats
summarise(quantile(Estimates)[3]) %>%
# add_column lets us add a new column
add_column(x = rep(c(.75,1.75,2.75), 6),
xend = rep(c(1.25,2.25,3.25), 6))
########
# plot #
########
ggplot(data = cells_long %>%
subset(Cell_type == "Ciliated cells" & Estimates < iqr$`max(ylim)`[1] |
Cell_type == "Macrophages" & Estimates < iqr$`max(ylim)`[2] |
Cell_type == "Dendritic cells" & Estimates < iqr$`max(ylim)`[3] |
Cell_type == "Goblet cells" & Estimates < iqr$`max(ylim)`[4] |
Cell_type == "Neutrophils" & Estimates < iqr$`max(ylim)`[5] |
Cell_type == "B cells" & Estimates < iqr$`max(ylim)`[6]),
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin(alpha = 0.5, colour = NA, trim = T) +
scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
# facet_wrap(~Cell_type, scales = "free") + # this is original facet function
# code for custom y axes on facets
facet_wrap_custom(~Cell_type, scales = "free", scale_overrides = list(
scale_override(1, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[1], 0.2))),
scale_override(2, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[2], 0.2))),
scale_override(3, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[3], 0.05))),
scale_override(4, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[4], 0.1))),
scale_override(5, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[5], 0.2))),
scale_override(6, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[6], .05))))) +
theme_bw() +
theme(strip.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none") +
geom_segment(data = med50, size = 1,
aes(y = `quantile(Estimates)[3]`, yend =`quantile(Estimates)[3]`,
x = x, xend=xend))
The authors also have points on top. We can add points using geom_point(). The goem for points should come before the geom_segment since the median line appears above both the violin plots and points. Geoms are always drawn in the order you specify. We will set the points to be coloured by viral status and use the same colours as the violins but greater opacity.
# to learn more about geom_segment
# ?geom_segment
ggplot(data = cells_long %>%
subset(Cell_type == "Ciliated cells" & Estimates < iqr$`max(ylim)`[1] |
Cell_type == "Macrophages" & Estimates < iqr$`max(ylim)`[2] |
Cell_type == "Dendritic cells" & Estimates < iqr$`max(ylim)`[3] |
Cell_type == "Goblet cells" & Estimates < iqr$`max(ylim)`[4] |
Cell_type == "Neutrophils" & Estimates < iqr$`max(ylim)`[5] |
Cell_type == "B cells" & Estimates < iqr$`max(ylim)`[6]),
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin(alpha = 0.5, colour = NA, trim = T) +
geom_point(aes(colour = Viral_status)) + # add the points
scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
scale_colour_manual(values=c("darkgrey", "skyblue", "indianred1")) +
facet_wrap_custom(~Cell_type, scales = "free", scale_overrides = list(
scale_override(1, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[1], 0.2))),
scale_override(2, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[2], 0.2))),
scale_override(3, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[3], 0.05))),
scale_override(4, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[4], 0.1))),
scale_override(5, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[5], 0.2))),
scale_override(6, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[6], .05))))) +
theme_bw() +
theme(strip.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none") +
geom_segment(data = med50, size = 1,
aes(y = `quantile(Estimates)[3]`, yend =`quantile(Estimates)[3]`,
x = x, xend=xend))
As you can see, adding points with geom_point isnt very informative because there is a great degree of overlap. It’s possible to plot the points in a way so there is less overlap, making the data more clear. It looks as though the authors have using a beeswarm plot. I’ve included (but commented out) two other ways to accomplish similar looks using geom_jitter and geom_dotplot.
# for more information on jittering and dotplots
# ?geom_jitter
# ?geom_dotplot
ggplot(data = cells_long %>%
subset(Cell_type == "Ciliated cells" & Estimates < iqr$`max(ylim)`[1] |
Cell_type == "Macrophages" & Estimates < iqr$`max(ylim)`[2] |
Cell_type == "Dendritic cells" & Estimates < iqr$`max(ylim)`[3] |
Cell_type == "Goblet cells" & Estimates < iqr$`max(ylim)`[4] |
Cell_type == "Neutrophils" & Estimates < iqr$`max(ylim)`[5] |
Cell_type == "B cells" & Estimates < iqr$`max(ylim)`[6]),
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin(alpha = 0.5, colour = NA, trim = T) +
# geom_dotplot(binaxis='y',
# stackdir = "center",
# binwidth = 1/1000,
# aes(col = Viral_status)) +
#geom_jitter(height = 0, aes(color = Viral_status)) +
geom_beeswarm(aes(color = Viral_status),
corral = "gutter", # this takes overlapping points and pushes them towards the centre
corral.width = 0.7,
size = 1,
cex = 2.5) + scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
scale_colour_manual(values=c("darkgrey", "skyblue", "indianred1")) +
facet_wrap_custom(~Cell_type, scales = "free", scale_overrides = list(
scale_override(1, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[1], 0.2))),
scale_override(2, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[2], 0.2))),
scale_override(3, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[3], 0.05))),
scale_override(4, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[4], 0.1))),
scale_override(5, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[5], 0.2))),
scale_override(6, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[6], .05))))) +
theme_bw() +
theme(strip.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none") +
geom_segment(data = med50, size = 1,
aes(y = `quantile(Estimates)[3]`, yend =`quantile(Estimates)[3]`,
x = x, xend=xend))
The text is a little messy on the graph. Let’s format this to look better. We can change the size, font, and spacing.
# element_text gives us lots of control over text properties
# check out which ones you might want to altering using this argument
# ?element_text
ggplot(data = cells_long %>%
subset(Cell_type == "Ciliated cells" & Estimates < iqr$`max(ylim)`[1] |
Cell_type == "Macrophages" & Estimates < iqr$`max(ylim)`[2] |
Cell_type == "Dendritic cells" & Estimates < iqr$`max(ylim)`[3] |
Cell_type == "Goblet cells" & Estimates < iqr$`max(ylim)`[4] |
Cell_type == "Neutrophils" & Estimates < iqr$`max(ylim)`[5] |
Cell_type == "B cells" & Estimates < iqr$`max(ylim)`[6]),
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin(alpha = 0.5, colour = NA, trim = T) +
geom_beeswarm(aes(color = Viral_status),
corral = "gutter", # this takes overlapping points and pushes them towards the centre
corral.width = 0.7,
size = 1,
cex = 2.5) + scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
# custom x axis labels with new lines
scale_x_discrete(labels = c("No \nvirus", "SARS-\nCoV-2", "Other \nvirus")) +
# change y axis name
labs(y = "Cell type fraction") +
scale_colour_manual(values=c("darkgrey", "skyblue", "indianred1")) +
facet_wrap_custom(~Cell_type, scales = "free", scale_overrides = list(
scale_override(1, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[1], 0.2))),
scale_override(2, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[2], 0.2))),
scale_override(3, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[3], 0.05))),
scale_override(4, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[4], 0.1))),
scale_override(5, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[5], 0.2))),
scale_override(6, scale_y_continuous(breaks = seq(0, iqr$`max(ylim)`[6], .05))))) +
theme_bw() +
theme(strip.background = element_blank(),
strip.text = element_text(size=12), # change facet title size
axis.text.x = element_text(size=10), # change x axis text size
axis.text.y = element_text(size=10), # change y axis text size
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none",
axis.title.x = element_blank()) + # remove x title
geom_segment(data = med50, size = 1,
aes(y = `quantile(Estimates)[3]`, yend =`quantile(Estimates)[3]`,
x = x, xend=xend))
Pairwise comparisons between patient groups were performed with a two-sided Mann–Whitney–Wilcoxon test (non-parametric) followed by Bonferroni’s correction. For two sample comparisons like t-tests and the mann-whitney, data can be in either long or wide format. We can either use the ggsignif package to run the MW tests and also add significance bars or do this using geom_segment and annotate. Alternatively, some individuals opt to draw the significance bars in themselves. If you look closely at fig1c, it seems the authors have gone with the later. Because we’ve excluded points by changing the y limits, the stat function in geom-signif will not consider all of all data so we should calculate significance our this this ourselves. We can do some data manipulation using tidyverse to make this easier.
The mann whiteney tests
# almost all base R stats are easy to call
# the most important this is to make sure your data is formatted properly
# and that you are making the comparisons you think you are
# for more information on stats
# ?mann.whitney
# ?t.test
# ?ks.test
# ?anova
# etc....
# sars-cov-2 vs none
sc2_other_mw <- cells_long %>%
# subset for the viral statuses we are comparing
subset(Viral_status %in% c("SARS-CoV-2", "Other virus")) %>%
# group according to the comparisons
group_by(Cell_type, Viral_status) %>%
# summarize estimates into list to clean up
summarise(value = list(Estimates)) %>%
# make columns equal to lists of estimates for viral status (rows are cell type)
spread(Viral_status, value) %>%
# group by cell type
group_by(Cell_type) %>%
# mann whitney tests
mutate(pvalue = wilcox.test(x=unlist(`SARS-CoV-2`),
y=unlist(`Other virus`),
# multiple pvalue by 3 for bonferroni correction
# since there are 3 comparisons for each group
alternative = "two.sided" )$p.value*3) %>%
# add columns to specify formatting for line segments
add_column(xmin = 2.1,
xmax = 2.9,
x = 2.5,
y_position = c(0.76, 0.6, 0.15, 0.23, 0.6, 0.15),
y_position_text = c(0.805, 0.635, 0.16, 0.245, 0.635, 0.16),
Viral_status = "SARS-CoV-2",
comparison = "sc2_other")
# sars-cov-2 vs non3
sc2_none_mw <- cells_long %>%
subset(Viral_status %in% c("SARS-CoV-2", "No virus")) %>%
group_by(Cell_type, Viral_status) %>%
summarise(value = list(Estimates)) %>%
spread(Viral_status, value) %>%
group_by(Cell_type) %>%
mutate(pvalue = wilcox.test(x=unlist(`SARS-CoV-2`),
y=unlist(`No virus`),
alternative = "two.sided" )$p.value*3) %>%
add_column(xmin = 1.1,
xmax = 1.9,
x = 1.5,
y_position = c(0.76, 0.6, 0.15, 0.23, 0.6, 0.15),
y_position_text = c(0.805, 0.635, 0.16, 0.245, 0.635, 0.16),
Viral_status = "Other virus",
comparison = "sc2_none")
# other vs none
none_other_mw <- cells_long %>%
subset(Viral_status %in% c("Other virus", "No virus")) %>%
group_by(Cell_type, Viral_status) %>%
summarise(value = list(Estimates)) %>%
spread(Viral_status, value) %>%
group_by(Cell_type) %>%
mutate(pvalue = wilcox.test(x=unlist(`Other virus`),
y=unlist(`No virus`),
alternative = "two.sided" )$p.value*3) %>%
add_column(xmin = 1.1,
xmax = 2.9,
x = 2,
y_position = c(0.88, 0.7, 0.18, 0.27, 0.72, 0.18),
y_position_text = c(0.925, 0.735, 0.19, 0.285, 0.755, 0.19),
Viral_status = "No virus",
comparison = "other_none")
# rbind together
mw_test <- rbind(sc2_other_mw %>% dplyr::select(Cell_type, pvalue:comparison),
sc2_none_mw %>% dplyr::select(Cell_type, pvalue:comparison),
none_other_mw %>% dplyr::select(Cell_type, pvalue:comparison))
The plot with error bars
ggplot(data = cells_long %>%
subset(Cell_type == "Ciliated cells" & Estimates < iqr$`max(ylim)`[1] |
Cell_type == "Macrophages" & Estimates < iqr$`max(ylim)`[2] |
Cell_type == "Dendritic cells" & Estimates < iqr$`max(ylim)`[3] |
Cell_type == "Goblet cells" & Estimates < iqr$`max(ylim)`[4] |
Cell_type == "Neutrophils" & Estimates < iqr$`max(ylim)`[5] |
Cell_type == "B cells" & Estimates < iqr$`max(ylim)`[6]),
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin(alpha = 0.5, colour = NA, trim = T) +
geom_beeswarm(aes(color = Viral_status),
corral = "gutter", # this takes overlapping points and pushes them towards the centre
corral.width = 0.7,
size = 1,
cex = 2.5) + scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
# custom x axis labels with new lines
scale_x_discrete(labels = c("No \nvirus", "SARS-\nCoV-2", "Other \nvirus")) +
# change y axis name
labs(y = "Cell type fraction") +
scale_colour_manual(values=c("darkgrey", "skyblue", "indianred1")) +
facet_wrap_custom(~Cell_type, scales = "free", scale_overrides = list(
scale_override(1, scale_y_continuous(breaks = seq(0, 1, 0.2),
limits = c(0,1))),
scale_override(2, scale_y_continuous(breaks = seq(0, 0.8, 0.2),
limits = c(0,0.8))),
scale_override(3, scale_y_continuous(breaks = seq(0, 0.2, 0.05),
limits = c(0,0.2))),
scale_override(4, scale_y_continuous(breaks = seq(0, 0.3, 0.1),
limits = c(0,0.3))),
scale_override(5, scale_y_continuous(breaks = seq(0, 0.8, 0.2),
limits = c(0,0.8))),
scale_override(6, scale_y_continuous(breaks = seq(0, 0.2, .05),
limits = c(0,0.2))))) +
theme_bw() +
theme(strip.background = element_blank(),
strip.text = element_text(size=12), # change facet title size
axis.text.x = element_text(size=10), # change x axis text size
axis.text.y = element_text(size=10), # change y axis text size
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none",
axis.title.x = element_blank()) + # remove x title
geom_segment(data = med50, size = 1,
aes(y = `quantile(Estimates)[3]`, yend =`quantile(Estimates)[3]`,
x = x, xend=xend)) +
geom_segment(data = mw_test, size = 0.5,
aes(y = y_position, yend =y_position,
x = xmin, xend=xmax)) +
# signif rounds and uses scientific notation where required
geom_text(data = mw_test,
aes(x = x, y = y_position_text, label = signif(pvalue, 2)), size = 3.5)
# For practice, try adding error bars to the plot using the geom_signif function
# ?geom_signif
# hints
# https://stackoverflow.com/questions/56219468/using-ggsignif-with-grouped-bar-graphs-and-facet-wrap-not-working
# https://stackoverflow.com/questions/45136550/how-to-annotate-different-values-for-each-facet-bar-plot-on-r
# hint: it's not going to be perfect
We are so close. However, our violin plots look a little skinny. The violin plot geom has an option to modify how the scale is calculated. If the scale is set to “area” (default) all violins have the same area (before trimming the tails). If the scale is set to “count” areas are scaled proportionally to the number of observations, and if the scale is set to “width” all violins have the same maximum width. Based on the fig 1c, it looks like the authors used scale set to “width”. Looking closely it also looks like their points might be a little transparent. We will set alpha of the beeswarm points as well.
# we looked at this before
# but taking a second look tells us we might have missed something in the scale
# ?geom_violin
ggplot(data = cells_long %>%
subset(Cell_type == "Ciliated cells" & Estimates < iqr$`max(ylim)`[1] |
Cell_type == "Macrophages" & Estimates < iqr$`max(ylim)`[2] |
Cell_type == "Dendritic cells" & Estimates < iqr$`max(ylim)`[3] |
Cell_type == "Goblet cells" & Estimates < iqr$`max(ylim)`[4] |
Cell_type == "Neutrophils" & Estimates < iqr$`max(ylim)`[5] |
Cell_type == "B cells" & Estimates < iqr$`max(ylim)`[6]),
aes(x=Viral_status, y = Estimates, fill = Viral_status)) +
geom_violin(alpha = 0.4, colour = NA, trim = T, scale = "width") +
geom_beeswarm(aes(color = Viral_status),
corral = "gutter", # this takes overlapping points and pushes them towards the centre
corral.width = 0.7,
size = 1,
cex = 2.5,
alpha = 0.8) +
scale_fill_manual(values=c("darkgrey", "skyblue", "indianred1")) +
# custom x axis labels with new lines
scale_x_discrete(labels = c("No \nvirus", "SARS-\nCoV-2", "Other \nvirus")) +
# change y axis name
labs(y = "Cell type fraction") +
scale_colour_manual(values=c("darkgrey", "skyblue", "indianred1")) +
facet_wrap_custom(~Cell_type, scales = "free", scale_overrides = list(
scale_override(1, scale_y_continuous(breaks = seq(0, 1, 0.2),
limits = c(0,1))),
scale_override(2, scale_y_continuous(breaks = seq(0, 0.8, 0.2),
limits = c(0,0.8))),
scale_override(3, scale_y_continuous(breaks = seq(0, 0.2, 0.05),
limits = c(0,0.2))),
scale_override(4, scale_y_continuous(breaks = seq(0, 0.3, 0.1),
limits = c(0,0.3))),
scale_override(5, scale_y_continuous(breaks = seq(0, 0.8, 0.2),
limits = c(0,0.8))),
scale_override(6, scale_y_continuous(breaks = seq(0, 0.2, .05),
limits = c(0,0.2))))) +
theme_bw() +
theme(strip.background = element_blank(),
strip.text = element_text(size=12), # change facet title size
axis.text.x = element_text(size=10), # change x axis text size
axis.text.y = element_text(size=10), # change y axis text size
axis.line = element_line(colour = "black"),
panel.grid = element_blank(),
panel.border = element_blank(),
legend.position = "none",
axis.title.x = element_blank()) + # remove x title
geom_segment(data = med50, size = 1,
aes(y = `quantile(Estimates)[3]`, yend =`quantile(Estimates)[3]`,
x = x, xend=xend)) +
geom_segment(data = mw_test, size = 0.5,
aes(y = y_position, yend =y_position,
x = xmin, xend=xmax)) +
# signif rounds and uses scientific notation where required
geom_text(data = mw_test,
aes(x = x, y = y_position_text, label = signif(pvalue, 2)), size = 3.5)
And that’s it! That’s about as close as we can get to the original plot. You may be asking why we can’t replicate it directly and that’s because the authors likely finished editing the figure in powerpoint after making the plot. The main clue pointing to this is the fact that their significance bars are above their plot - something that R cannot do. Additionally, they have to y-axis titles - one for the top half and one for the bottom. While it’s possible to accomplish this in R, it’s more likely it was done after the fact. It’s okay to do some figure editing (like adding in titles and significance comparisons) after, but doing more in R helps to ensure reproducibility!
Today we covered faceting, but it possible to recreate a multi-panel figure without faceting using the R package “cowplot”. Cowplot takes individual plots and combines them into a paneled figure. For practice, try recreating figure 1C using cowplot instead the facet geom. Cowplot is nice because it lets you control the aspect ratio of figures compared to each other. Additionally, since we are using cowplot, it will now be easier to incorporate error bars using ggsignif. Try practicing using this as well!
# ?plot_grid
# ?ggsignif
# hint - you should annotate the significance bars yourself!
# plotting hints:
# start with individual plots
# ciliated_plot <- ggplot() + ggsignif()
# macrophage_plot <- ggplot() + ggsignif()
# dendritic_plot <- ggplot() + ggsignif()
# goblet_plot <- ggplot() + ggsignif()
# neutrophil_plot <- ggplot() + ggsignif()
# bcell_plot <- ggplot() + ggsignif()
# then combine in cowplot
# plot_grid(ciliated_plot, macrophage_plot, dendritic_plot,
# goblet_plot, neutrophil_plot, bcell_plot,
# nrow = 2, ncol = 3)
A volcano plot is commonly used to display the results from limma. For gene expression data, the x-axis is typically represents log2 of gene expression and the y-axis represents the log10 of unadjusted p-values. Often, points which are considered significant based on p-value and differential expression are highlighted by using a different colour.
As an example, Figures 1A and 1 B the paper titled “Comparative transcriptome analyses reveal genes associated with SARS-CoV-2 infection of human lung epithelial cells” demonstrate volcano plots with significant points coloured.
Try making a volcano plot for the comparison between SARS-CoV-2 infections and health controls (no virus).
#ggplot() +
# geom_point() +
# geom_hline() + # add in a line for p value significance
# geom_vline() # add in a line for gene expression significance